Introspective Fault Tolerance for Exascale Systems∗

نویسندگان

  • Rinku Gupta
  • Kamil Iskra
  • Kazutomo Yoshii
  • Pavan Balaji
  • Pete Beckman
چکیده

Faults and errors are an unavoidable aspect of high performance computing systems. Emerging exascale systems will contain billions of hardware components and complex software stacks. In addition, higher fabrication density and power challenges will further compound fault detection, management and recovery. Efficient fault tolerance and resiliency frameworks are thus of immense importance in the path to the exascale era [7].

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Algorithms and Scheduling Techniques for Exascale Systems

Exascale systems to be deployed in the near future will come with deep hierarchical parallelism, will exhibit various levels of heterogeneity, will be prone to frequent component failures, and will face tight power consumption constraints. The notion of application performance in these systems becomes multi-criteria, with fault-tolerance and power consumption metrics to be considered in additio...

متن کامل

Toward Exascale Resilience

Over the past few years resilience has became a major issue for HPC systems, in particular in the perspective of large Petascale systems and future Exascale ones. These systems will typically gather from half a million to several millions of CPU cores running up to a billion of threads. From the current knowledge and observations of existing large systems, it is anticipated that Exascale system...

متن کامل

Cooperative Application/OS DRAM Fault Recovery

Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience chal...

متن کامل

Fault Tolerance Techniques for Scalable Computing∗

The largest systems in the world today already scale to hundreds of thousands of cores. With plans under way for exascale systems to emerge within the next decade, we will soon have systems comprising more than a million processing elements. As researchers work toward architecting these enormous systems, it is becoming increasingly clear that, at such scales, resilience to hardware faults is go...

متن کامل

A Global Exception Fault Tolerance Model for MPI

Driven both by the anticipated hardware reliability constraints for exascale systems, and the desire to use MPI in a broader application space, there is an ongoing effort to incorporate fault tolerance constructs into MPI. Several faulttolerant models have been proposed for MPI [1], [2], [3], [4]. However, despite these attempts, and over six years of effort by the MPI Forum’s [5] Fault Toleran...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013